86 research outputs found
AutoAccel: Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
CPU-FPGA heterogeneous architectures are attracting ever-increasing attention
in an attempt to advance computational capabilities and energy efficiency in
today's datacenters. These architectures provide programmers with the ability
to reprogram the FPGAs for flexible acceleration of many workloads.
Nonetheless, this advantage is often overshadowed by the poor programmability
of FPGAs whose programming is conventionally a RTL design practice. Although
recent advances in high-level synthesis (HLS) significantly improve the FPGA
programmability, it still leaves programmers facing the challenge of
identifying the optimal design configuration in a tremendous design space.
This paper aims to address this challenge and pave the path from software
programs towards high-quality FPGA accelerators. Specifically, we first propose
the composable, parallel and pipeline (CPP) microarchitecture as a template of
accelerator designs. Such a well-defined template is able to support efficient
accelerator designs for a broad class of computation kernels, and more
importantly, drastically reduce the design space. Also, we introduce an
analytical model to capture the performance and resource trade-offs among
different design configurations of the CPP microarchitecture, which lays the
foundation for fast design space exploration. On top of the CPP
microarchitecture and its analytical model, we develop the AutoAccel framework
to make the entire accelerator generation automated. AutoAccel accepts a
software program as an input and performs a series of code transformations
based on the result of the analytical-model-based design space exploration to
construct the desired CPP microarchitecture. Our experiments show that the
AutoAccel-generated accelerators outperform their corresponding software
implementations by an average of 72x for a broad class of computation kernels
Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs
As deep learning models nowadays are widely adopted by both cloud services
and edge devices, reducing the latency of deep learning model inferences
becomes crucial to provide efficient model serving. However, it is challenging
to develop efficient tensor programs for deep learning operators due to the
high complexity of modern accelerators and the rapidly growing number of
operators. Deep learning compilers, such as Apache TVM, adopt declarative
scheduling primitives to lower the bar of developing tensor programs. However,
we show that this approach is insufficient to cover state-of-the-art tensor
program optimizations. In this paper, we propose to embed the scheduling
process into tensor programs and use dedicated mappings, called task mappings,
to define the computation assignment and ordering. This new approach greatly
enriches the expressible optimizations by allowing developers to manipulate
tensor programs at a much finer granularity. We call the proposed method the
task-mapping programming paradigm. In addition, we propose a new
post-scheduling fusion optimization that allows developers to focus on
scheduling every single operator and automates the fusion after scheduling. It
greatly reduces the engineering efforts for operator fusion. Our proposed
paradigm also constructs an efficient hardware-centric schedule space, which is
agnostic to the program input size and greatly reduces the tuning time. With
the proposed paradigm, we implement a deep learning compiler Hidet. Extensive
experiments on modern convolution and transformer models show that Hidet
outperforms state-of-the-art DNN inference framework, ONNX Runtime, and
compiler, TVM equipped with scheduler AutoTVM and Ansor, by up to 1.48x (1.22x
on average). It also reduces the tuning time by 20x and 11x compared with
AutoTVM and Ansor, respectively. We open-sourced hidet at
https://www.github.com/hidet-org/hidet.Comment: 15 pages, 22 figures, 1 tabl
Decoupled Model Schedule for Deep Learning Training
Recent years have seen an increase in the development of large deep learning
(DL) models, which makes training efficiency crucial. Common practice is
struggling with the trade-off between usability and performance. On one hand,
DL frameworks such as PyTorch use dynamic graphs to facilitate model developers
at a price of sub-optimal model training performance. On the other hand,
practitioners propose various approaches to improving the training efficiency
by sacrificing some of the flexibility, ranging from making the graph static
for more thorough optimization (e.g., XLA) to customizing optimization towards
large-scale distributed training (e.g., DeepSpeed and Megatron-LM).
In this paper, we aim to address the tension between usability and training
efficiency through separation of concerns. Inspired by DL compilers that
decouple the platform-specific optimizations of a tensor-level operator from
its arithmetic definition, this paper proposes a schedule language to decouple
model execution from definition. Specifically, the schedule works on a PyTorch
model and uses a set of schedule primitives to convert the model for common
model training optimizations such as high-performance kernels, effective 3D
parallelism, and efficient activation checkpointing. Compared to existing
optimization solutions, we optimize the model as-needed through high-level
primitives, and thus preserving programmability and debuggability for users to
a large extent. Our evaluation results show that by scheduling the existing
hand-crafted optimizations in a systematic way, we are able to improve training
throughput by up to 3.35x on a single machine with 8 NVIDIA V100 GPUs, and by
up to 1.32x on multiple machines with up to 64 GPUs, when compared to the
out-of-the-box performance of DeepSpeed and Megatron-LM
TensorIR: An Abstraction for Automatic Tensorized Program Optimization
Deploying deep learning models on various devices has become an important
topic. The wave of hardware specialization brings a diverse set of acceleration
primitives for multi-dimensional tensor computations. These new acceleration
primitives, along with the emerging machine learning models, bring tremendous
engineering challenges. In this paper, we present TensorIR, a compiler
abstraction for optimizing programs with these tensor computation primitives.
TensorIR generalizes the loop nest representation used in existing machine
learning compilers to bring tensor computation as the first-class citizen.
Finally, we build an end-to-end framework on top of our abstraction to
automatically optimize deep learning models for given tensor computation
primitives. Experimental results show that TensorIR compilation automatically
uses the tensor computation primitives for given hardware backends and delivers
performance that is competitive to state-of-art hand-optimized systems across
platforms.Comment: Accepted to ASPLOS 202
Ansor : Generating High-Performance Tensor Programs for Deep Learning
High-performance tensor programs are crucial to guarantee efficient execution
of deep neural networks. However, obtaining performant tensor programs for
different operators on various hardware platforms is notoriously challenging.
Currently, deep learning systems rely on vendor-provided kernel libraries or
various search strategies to get performant tensor programs. These approaches
either require significant engineering effort to develop platform-specific
optimization code or fall short of finding high-performance programs due to
restricted search space and ineffective exploration strategy.
We present Ansor, a tensor program generation framework for deep learning
applications. Compared with existing search strategies, Ansor explores many
more optimization combinations by sampling programs from a hierarchical
representation of the search space. Ansor then fine-tunes the sampled programs
with evolutionary search and a learned cost model to identify the best
programs. Ansor can find high-performance programs that are outside the search
space of existing state-of-the-art approaches. In addition, Ansor utilizes a
task scheduler to simultaneously optimize multiple subgraphs in deep neural
networks. We show that Ansor improves the execution performance of deep neural
networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA
GPU by up to , , and , respectively.Comment: Published in OSDI 202
CXCR5<sup>+</sup> follicular cytotoxic T cells control viral infection in B cell follicles
During unresolved infections, some viruses escape immunological control and establish a persistant reservoir in certain cell types, such as human immunodeficiency virus (HIV), which persists in follicular helper T cells (TFH cells), and Epstein-Barr virus (EBV), which persists in B cells. Here we identified a specialized group of cytotoxic T cells (TC cells) that expressed the chemokine receptor CXCR5, selectively entered B cell follicles and eradicated infected TFH cells and B cells. The differentiation of these cells, which we have called 'follicular cytotoxic T cells' (TFC cells), required the transcription factors Bcl6, E2A and TCF-1 but was inhibited by the transcriptional regulators Blimp1, Id2 and Id3. Blimp1 and E2A directly regulated Cxcr5 expression and, together with Bcl6 and TCF-1, formed a transcriptional circuit that guided TFC cell development. The identification of TFC cells has far-reaching implications for the development of strategies to control infections that target B cells and TFH cells and to treat B cell–derived malignancies
Prevalence, associated factors and outcomes of pressure injuries in adult intensive care unit patients: the DecubICUs study
Funder: European Society of Intensive Care Medicine; doi: http://dx.doi.org/10.13039/501100013347Funder: Flemish Society for Critical Care NursesAbstract: Purpose: Intensive care unit (ICU) patients are particularly susceptible to developing pressure injuries. Epidemiologic data is however unavailable. We aimed to provide an international picture of the extent of pressure injuries and factors associated with ICU-acquired pressure injuries in adult ICU patients. Methods: International 1-day point-prevalence study; follow-up for outcome assessment until hospital discharge (maximum 12 weeks). Factors associated with ICU-acquired pressure injury and hospital mortality were assessed by generalised linear mixed-effects regression analysis. Results: Data from 13,254 patients in 1117 ICUs (90 countries) revealed 6747 pressure injuries; 3997 (59.2%) were ICU-acquired. Overall prevalence was 26.6% (95% confidence interval [CI] 25.9–27.3). ICU-acquired prevalence was 16.2% (95% CI 15.6–16.8). Sacrum (37%) and heels (19.5%) were most affected. Factors independently associated with ICU-acquired pressure injuries were older age, male sex, being underweight, emergency surgery, higher Simplified Acute Physiology Score II, Braden score 3 days, comorbidities (chronic obstructive pulmonary disease, immunodeficiency), organ support (renal replacement, mechanical ventilation on ICU admission), and being in a low or lower-middle income-economy. Gradually increasing associations with mortality were identified for increasing severity of pressure injury: stage I (odds ratio [OR] 1.5; 95% CI 1.2–1.8), stage II (OR 1.6; 95% CI 1.4–1.9), and stage III or worse (OR 2.8; 95% CI 2.3–3.3). Conclusion: Pressure injuries are common in adult ICU patients. ICU-acquired pressure injuries are associated with mainly intrinsic factors and mortality. Optimal care standards, increased awareness, appropriate resource allocation, and further research into optimal prevention are pivotal to tackle this important patient safety threat
- …